Univariate Plots Section
## [1] 4898 12
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
Our data set consists of 4898 entries each with 12 variables / columns.

Graphs above show that with majority being around quality 5-7, only few wines are of quality 9 or 3. Second graph shows distribution in a vertical logarithmic scale.
In this project we will be focusing on interaction of different variables in addition to the quality factor. However our quality factor consists of too many classes from 3 to 9. In a move to simplify this factor, I want to map qualities into 3 distinct qualities classes: 1. Lesser: Qualities 3, 4, 5 2. Average: Quality 6 3. Higher: Qualities: 7, 8, 9
Let’s all also look into distribution of other variables with respect to quality.


Distribution is normal. There are few outliers, some with too high values. To display histogram in a better way, I filtered outlier values by applying a quantile filter between 0.01 - 0.99.
Below are summary of the variable fixed.acidity.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200


Distribution is normal with a little skewness towards right. We will look into volatile.acidity to see if it effects the quality.There are quite many outliers, which could also be dues to variable being skew towards right. Many of outliers seems to be in reasonable range. To display histogram in a better way, I filtered outlier values by applying a quantile filter between 0.01 - 0.99.
Below are summary of the variable volatile.acidity.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000


Distribution is normal. There are few outliers, some with too high values. To display histogram in a better way, I filtered outlier values by applying a quantile filter between 0.01 - 0.99. However, there seems to be a bounce around 0.5. This could be due to acidity being reported with rounded values.
Below are summary of the variable citric.acid.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600


Distribution is skew towards right. But when plotted in logarithmic scale the distribution seems to be bimodal. There are very few outliers, some with too high values. I already plotted histogram in logarithmic x scale, so I didn’t need to apply further quantile filtering. I used a quantile filter between 0 - 1, which will retain all values without filtering.
Below are summary of the variable residual.sugar.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800


Distribution is normal. There are many outliers. The outliers seem to be forming a uniform distribution between 0.1 - 0.2. To display histogram in a better way, I filtered outlier values by applying a quantile filter between 0.01 - 0.95. The upper quantile is quite lower than the earlier plots due to this uniform region between 0.1 - 0.2.
Below are summary of the variable chlorides.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600


Distribution is normal. There are few outliers, some with too high values. To display histogram in a better way, I filtered outlier values by applying a quantile filter between 0.01 - 0.99.
Below are summary of the variable free.sulfur.dioxide.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00


Distribution is normal. There are few outliers. To display histogram in a better way, I filtered outlier values by applying a quantile filter between 0.01 - 0.99.
Below are summary of the variable total.sulfur.dioxide.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0


Distribution is normal. There are very few outliers, some with too high values. To display histogram in a better way, I filtered outlier values by applying a quantile filter between 0.01 - 0.99.
Below are summary of the variable density.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390


Distribution is normal. There are many outliers, most with reasonable values. To display histogram in a better way, I filtered outlier values by applying a quantile filter between 0.01 - 0.99.
Below are summary of the variable pH.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820


Distribution is normal. There are many outliers. To display histogram in a better way, I filtered outlier values by applying a quantile filter between 0.01 - 0.99.
Below are summary of the variable sulphates.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800


Distribution is normal. There are no outliers. This is probably because, wines having an accepted alcohol level range which is beween 8 - 14. Therefore, I used a quantile filter between 0 - 1, which will retain all values.
Below are summary of the variable alcohol.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Bivariate Plots Section
Let’s first start with a test to see the corelation between all variables:
> The pairs alcohol - density(-0.8) and density - residual.sugar(0.8) seem to have exceptionally high correlations.
However we are more concerned with quality feature. Therefore, we will first look into other variables’ interaction with quality.
The highest correation with quality seems to be between alcohol(0.4) and density(-0.3). Variables volatile.acidity(-0.2), chlorides(-0.2), and total.sulfur.dioxide(-0.2) also seem to have correlations with quality alas at lot lower measures.
Primary Variables
- density
- alcohol


With a little jitter for quality, which is a discrete value, it seems there is a negative correlation between density and quality.
Boxplot shows this relation better, with distributions for quality.
Below you can also find numerical values for these box plots:
## ww$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9911 0.9925 0.9944 0.9949 0.9969 1.0001
## --------------------------------------------------------
## ww$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9892 0.9926 0.9941 0.9943 0.9958 1.0004
## --------------------------------------------------------
## ww$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9872 0.9933 0.9953 0.9953 0.9972 1.0024
## --------------------------------------------------------
## ww$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9876 0.9917 0.9937 0.9940 0.9959 1.0390
## --------------------------------------------------------
## ww$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9906 0.9918 0.9925 0.9937 1.0004
## --------------------------------------------------------
## ww$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9903 0.9916 0.9922 0.9935 1.0006
## --------------------------------------------------------
## ww$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9897 0.9898 0.9903 0.9915 0.9906 0.9970
##
## Pearson's product-moment correlation
##
## data: density and quality
## t = -22.926, df = 4798, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3394837 -0.2884840
## sample estimates:
## cor
## -0.3142105
Correlation test shows that there is a correlation between [-0.34, -0.29] for these variables in a confidence level of 95%.


With a little jitter for quality, which is a discrete value, it seems there is a positive correlation between alcohol and quality.
Boxplot shows this relation better, with distributions for quality.
Below you can also find numerical values for these box plots:
## ww$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.55 10.45 10.35 11.00 12.60
## --------------------------------------------------------
## ww$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.40 10.10 10.15 10.75 13.50
## --------------------------------------------------------
## ww$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.000 9.200 9.500 9.809 10.300 13.600
## --------------------------------------------------------
## ww$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.60 10.50 10.58 11.40 14.00
## --------------------------------------------------------
## ww$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.60 10.60 11.40 11.37 12.30 14.20
## --------------------------------------------------------
## ww$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 11.00 12.00 11.64 12.60 14.00
## --------------------------------------------------------
## ww$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 12.40 12.50 12.18 12.70 12.90
##
## Pearson's product-moment correlation
##
## data: alcohol and quality
## t = 32.392, df = 4720, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4028330 0.4495123
## sample estimates:
## cor
## 0.4264566
Correlation test shows that there is a correlation between [0.40, 0.45] for these variables in a confidence level of 95%.
Secondary Variables
- volatile.acidity
- chlorides
- total.sulfur.dioxide

##
## Pearson's product-moment correlation
##
## data: volatile.acidity and quality
## t = -11.36, df = 4778, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1896410 -0.1344307
## sample estimates:
## cor
## -0.1621628
Correlation test shows that there is a correlation between [-0.19, -0.13] for these variables in a confidence level of 95%.

##
## Pearson's product-moment correlation
##
## data: chlorides and quality
## t = -17.025, df = 4790, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2653861 -0.2119856
## sample estimates:
## cor
## -0.2388664
Correlation test shows that there is a correlation between [-0.24 -0.18] for these variables in a confidence level of 95%.

##
## Pearson's product-moment correlation
##
## data: total.sulfur.dioxide and quality
## t = -13.024, df = 4798, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2119747 -0.1573235
## sample estimates:
## cor
## -0.1847919
Correlation test shows that there is a correlation between [-0.21, -0.16] for these variables in a confidence level of 95%.
Multivariate Plots Section
Let’s first look into variables that we found have a strong correlation with each other. I would also like to display what qualities they have by applying a colur gradient on data points.
## `geom_smooth()` using method = 'gam'

## `geom_smooth()` using method = 'gam'

In Density - Alcohol graph we can see that different qualities are actually uniformly distributed in density axes[vertical]. However on alcohol axes[horizontal], we see that the frequency of higher qualities is increasing. We already saw this relationship earlier, so I am not going further into detail.

In Density - residual.sugar graph, ther seems to be a difference of distribution of different qualities in vertical axes[density]. Perhaps, this hints that higher quality wines tends to be of lower density. However this could also be due to alcohol having an effect on density.

On the hand, there seems to be a relationship with a newgative correlation between residual.sugar and alcohol. This makes sense, since to have higher degrees of alcohol, most of sugar needs to be converted/fermented into alcohol. The strage thing about all these graphs is the threeway relationship that we will talk later.
To eliminate effect of density on alcohol and vice versa, I will repeat above first two plots by dividing each by the other.


It seems when we removed residual.sugar effect from density by dividing density by normalized (0-1) residual sugar values, the relationship with alcohol seems to be still there. Therefore, we can presume that most of density alcohol relationship is actually alcohol oriented. However, this relationship breaks at the higher the alcohol levels. This hints that sugar effect on density becomes significant at these levels.
However, the relationship with residual.sugar is moslty lost when we remove alcohol effect the same way.
From elementary chemistry, we know the realtionship between clorides, pH and acidity. I would like to draw a few plots regarding these variables. In a strange way, however, this process created distinct groups of density / alcohol.normalized groups.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820

##
## Pearson's product-moment correlation
##
## data: pH and chlorides
## t = -6.3542, df = 4896, p-value = 2.285e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.11814666 -0.06259154
## sample estimates:
## cor
## -0.09043946
I am actually surprized that higher pH(lower acidity) does not correlate to higher amount of chlorides, as I would expect. Therefore, I would like to look into chlorides by eliminating effect of acids on pH.

Strangely, removing effects of fixed.acidity improved the visibility between different quality class. With this new graph we can say that higher quality wines tend to have lower levels of ph / fixed.acidity ratios.

Above graph shows scatter plot for residual.sugar vs. chlorides. Unlike bad quality [3,6) wines, the good quality (6,9] wines tend to have low levels of residual.sugar until it reaches a certain level of chlorides. Clorides being salt ingredient is possibly compensated with added sugars.
Reflection
Data set has 12 attributes each of which are quite meaningful and provides enough tools to work on it to understand if there is useful relationships within samples. The quality attribute also provides the mean to comment on effects of other variables on human perception of the product.
Data set has a sample size of 4898. This small sample size causes problems throught data exploration. The lack of required number of samples for a particular subgroup makes it difficult to conclude on some potential findings that otherwise, would be significant.
Data set is complete and consistent enough that it does not seem to have entries with missing values or meaningless values. This eases the data exploration. Throughout the data analysis process, the data showed consistent results in terms of completeness and distribution of data points.
The challange for me during this data exploration process was to understand relationship between variables that wouldnt easily give up there secrets. I failed to find correlation where I would expect them most. On the other hand, I encountered them in places that I least expected. Even though most of relationships could be more or less explained, they are mostly hidden until I reached them by visualizing this relationships.
Even though the process of finding a useful model or relationship for quality seems trivial, after a few steps it becomes more apparent that the quality is a much more complex phenomenon and human perception depends on a more complex set of variables. At this point, it seems to me that it would be useful to know what exactly made the taster to score the wine high or low quality. For instance, it could have been noted that the taster liked its sweetness, acidity, alcohol ratio etc.